Quantitative Text Analysis of ‘The Guardian’ headlines

Executive summary

The following report presents a quantitative text analysis of “The Guardian” newspaper’s headlines in the periods going from 18 October 2021 and 28 November 2021 to apply a sentiment analysis around the negotiations of COP26 that took place in Glasgow from October 31st and November 12th. Such descriptive analysis aims at finding eventual changes in the opinions and/or political positioning of the abovementioned media outlet.

Statement of contributions

The load has been equally distributed among the team members. The work included following main tasks: webscraping, data cleaning, sentiment analysis, visualization and reporting.The structure of our project allowed us to equally divide the work on a rotation scheme where each person started a task that was later on commented and improved by the two others.

  • Laura: Webscraping, Bing et Al Dictionary, Quanteda package, rotation on all other tasks

  • Kat: Code efficiency, visualization uniformity, repo structure, rotation on all other tasks

  • Nassim: data cleaning, AFINN dictionary, interactive plots, rotation on all other tasks

Introduction

The 26th Conference of the Parties that took place in Glasgow has represented a crucial moment for climate policy negotiations.

The majority of the most influential international leaders attended the event to discuss on future global action regarding climate mitigation and adaptation, together with non-state actors and internationally renowned personalities. Such occasion gained substantial media attention from all over the world, with peaks in the intervals right before the starting of the COP (the so-called ‘PreCOP’ events), during the actual happening of the Conference, and right after the conclusion of such event.

Media outlets would approach climate change diffrently to reflect their political positioning: the headlines, the highlights as well as the frequently mentioned topics would differ based on political position.

We collect data from the headlines of a British newspaper to discover potential trends and changes in sentiment along the specific timeframe that goes from the weeks right before the Conference until the period right after it.

Motivation

As public policy students concerned with the climate negotiations, we are interested in investigating the opinions and attitudes expressed by media outlets in the above-mentioned periods of time. The main concern here is to unveil the attitude adopted by the media outlet, as the tone of the headlines influences the public opinion. Would these trends change before and after the negotiations are held?

The relevance of our analysis stands in our curiosity for newspapers’ behavior concerning international, critical occasions as the COP26. Do media outlets aim to inform people with the addition of a particular sentiment (that could also aim at reflecting a general feeling from the readers)? Or do theyprefer to remain neutral and objectively report factual events? A better understanding of this question and answering it on a data-driven approach could foster a deeper comprehension of the role of information and media in climate change developments.

To formulate a better understanding on these dynamics, we decided to analyze the sentiment of a single media outlet “The Guardian” published in the COP26 host country, the UK. This newspaper is more left-leaning, according to the findings of YouGov. On the table of COP26, Climate Financing was a hot topic. With the new American administration, the world leaders were looking forward to sense changes on the level of climate financing and discussing losses. hence choosing a non-neutral media outlet - which would have endorsed such topic - can show more compelling results in terms of changes in the political positioning with respect to the outcomes of the Conference, and that can eventually be reflected in the headlines.

Research question

The principal objective of this research aims at analyzing trends in the attitude of the newspaper headlines with respect to COP26 topics. Then, the other questions related to this analysis are related to two macro areas.

The first area has both quantitative and qualitative natures: - Do the words used in the headlines of ‘The Guardian’ spread a specific sentiment?
- What is the ratio of positive over negative words in the collected data? Does this ratio changes over time?
- Are there any possible interesting patterns among the most frequent words that could be inspected further?

The second area concerns a more specific analysis that also takes into account how such results could change when using different measurement instruments, in this case, dictionaries.
- Is the sentiment analysis consistent across different dictionaries? - Do any differences and/or overlaps induce any political interpretation? And are the results relevant for political interpretation?

Methods

The text analysis is performed mainly using three different packages: tidytext, tidytextdata and quanteda. These packages follow the tidyverse design philosophy. The main difference between these tools is that quanteda works with Corpus objects, proper of the NLP logic, while tidytext and tidytextdata can process texts in their character format. As these tools serve well our research objectives, the study used each one of them to emphasize on specific outcomes and compare the results. Specifically, tidytext and tidytextdata are useful to build analyses and visualizations with dates, in a less complicated way than with the quanteda document level variables. The quanteda package is particularly useful for the targeted sentiment analysis we conducted, together with the fact that it was possible to check the consistency of results also with another dictionary, the LSD2015 one. The keywords in context function were also used as an explorative tool.

Limitations

Being a newspaper of the host country of the climate negotiations, The Guardian would not represent an ideal sample of headlines that would allow us to deduce if COP26 has met the expectations or not through the sentiment analysis. Indeed, the results would only show the changes in opinions for the specific political leaning that such outlet represents. However, the value of this project isto apply procedures of sentiment analysis after scraping information from the web and present the results to the user in an accessible format. Therefore, it is necessary to acknowledge the very limited scope of this analysis. The relevance of such investigation can only be applied to this specific and small sample.

Additionally, a further limitation concerns the dates that have been scraped from The Guardian website. Given the used web-scraping strategy, the most recent dates (December and end of November 2021) present some missing values caused by a heterogeneous format in the website pages. For demonstration purposes we simply dropped those missing values, further limiting the scope of the analysis.

Retrieving the data

The webscraping, cleaning and formatting section of the analysis can be found in the R script scraping_and_data_cleaning that is available in the repository.

The webscraping strategy adopted consists in downloading the headlines from multiple pages of the newspaper website by date (static webscraping).

The formatting step includes transformation of dates into the correct format with lubridate and and data preparation for the quantitative text analysis with tidytext.

In this part, words regarding the main topic of the headlines (“cop26”, “glasgow”,“climate”,“change”) were expected to be very frequent, and were removed from the very beginning as customized stopwords since they do not contribute to the sentiment.

Explorative analysis

Through the exploration of the collected data, we aim at understanding which are the most frequent words and whether they could have a role in our investigation.

WordCloud and frequency table

Thanks to a frequency table and an explorative WordCloud, we visualize the most frequent words. We identify ‘crisis’ as the most frequent word (other than the customized stopwords) used in the headlines during the COP26 period. Other very

word n
crisis 58
net 52
world 50
video 48
australia 41
johnson 40
happened 38
global 34
boris 33
emissions 32

Keywords in context for the most frequent word

Based on a generated context table thanks to the keyword in context function from the quanteda package, the word crisis could be easily identified to be always used as part of the bigram “climate crisis” which is the core subject of the climate negotiations. As the word crisis has a negative effect on the overall sentiment analysis, we decided to drop it since the context in which it is being used was shown to be informative rather than sentiment-related.

Sentiment analysis

The sentiment analysis applied to the collected headlines is conducted using a dictionary-based method. The three used dictionaries are:
- ‘Bing et Al.’,
- ‘AFINN’, - ‘Lexicoder Sentiment Dictionary’ (LSD2015)

The choice of these dictionaries is mainly based on common practice and on the objective of our research to check the sentiment of the headlines around the climate negotiations, quantify them and detect any potential patterns and the consistency of these results.

From the tidytext package, we use the ‘Bing et al.’ and the ‘AFINN’ dictionaries. These are general-purpose lexicons based on unigrams (single words). The first one classifies the words into negative or positive, while the second one scales the sentiment by assigning a value between a range of -5 and +5, classifying words with values very negative and very positive respectively.

From the quanteda package, the Lexicoder Sentiment Dictionary represents a more than valid alternative, due to its particular versatility with respect to sentiment analysis for political communication (Young, L. & Soroka, S., 2012). Such dictionary consists of 2,858 “negative” sentiment words and 1,709 “positive” sentiment words. The novelty of Young and Soroka approach stands in a further set of 2,860 and 1,721 negations of negative and positive words, respectively. However, we did not find such additional set useful for our research purposes.

As a result of the initial processing of the words, “Boris Johnson” had earned a large spot of the headlines of the British outlet, being the Prime Minister of the host country of the COP26 negotiations. Therefore, a special study is conducted to assess the sentiment of the headlines related to the UK Prime Minister. By applying such a comparative study based on the used dictionaries, the readers would have an idea about the similarities and differences in the sentiment of the headlines related to Boris and the overall sentiment analysis.

Bing et al. visualization

Sentiment frequency plot

As explained above, ‘Bing et al.’ classifies words into positive or negative. Applying this dictionary to our dataset resulted in assigning sentiment values to 295 words which are distributed over the examined time period. Two vertical dashed lines have been added for visualization purposes, as to identify the specific period COP26 happened.

The figure clearly shows that the count of words with either a positive or a negative classification has increased during the negotiations period (October 31st – November 12th). Yet, through this graph it is quite difficult to deduce any trend about the general sentiment found in the headlines of ‘The Guardian’.

Most frequently occuring Words per sentiment

To take a closer look on the different frequencies of the classifide words, the following faceted barplot shows the most frequently occuring Words per sentiment. The frequency has been set to be strictly higher than 4 occurrencies per day, according to the sentiment. It is noticeable that negative words seem to be more frequent than positive ones.

The first noticeable fact is that the word ‘protest’ appears twice in the negative words. This is because, due to our small dataset, the stemming part of the pre-processing was not carried out. Interestingly, ‘protest’ is considered as a negative word, but it can be argued that here the actual function of such word is very dependent on the context and the political positioning with respect to the climate crisis, as many protests during COP26 aimed at demanding more effective climate action.

Sentiment interactive plots by date

After a first glance of the Bing et al. dictionary classification, it can be worth to take a step back and see how such a classification of words is plotted over time. The following plots illustrate interactively the development of negative and positive words along the examined timeframe. Both plots show the count of each type of words by date. The first one shows the negative/positive ratio over time including non-classified words, while the second one is plotted excluding them, as to get a closer look on the changes of the counts for positive and negative classifications.

Through this plot, one can clearly notice the surge of words in the headlines that were published when COP26 was actually happening, and that the prevalent speech of ‘The Guardian’ headlines was rather not classified.

This surge of wordcounts and prevalence of non-classified words especially during the Conference could have been expected, as during this period a local news outlet would cover extensively the negotiations and report to them in a rather non-subjective way.

In this second graph, it is clear that negativity is prevailing over positivity of the headlines, especially after COP26.

As observed in the barplot showing the most frequently occurring words, negative unigrams like ‘protest’, ‘limit’, ‘poor’, etc… are strongly mentioned. This fact leads us to ask the following questions: - In the context of COP26 are protests negative? - Don’t protest movements like Fridays for Future have a positive impact on the climate crisis? - To which extent could these words be negative and biasing ?

AFINN visualization

AFINN analysis for all headlines

In order to have a better understanding on how much negative or positive are the used words in The Guardian’s headlines, we make use of the AFINN dictionary.

Such lexicon assigns values from -5 to +5 to the words it classifies, ranging thus from very negative to very positive respectively.

The plotly package is used to visualize the values of negativity and positivity. This package offers the opportunity to compare data by hovering over the points (‘Compare data on hover’ button), making it possible to examine the word count by date for each sentiment value.

The results are not as bad as represented in the previous plots. Trying to understand the extent to which these headlines are classified negative or positively definitely provides a cleare and more consistent picture of the collected data. The negative trend seen in the results of the Bing et Al. dictionary makes more sense after seeing that the vast majority of the classified words falls in the -2:+2 score range.

It is remarkable that not a single word was assigned a value or -5, or a value of +5. The reason could stand in the fact that the validation for such dictionaries has been made combining crowdsourcing, restaurant or movie reviews, or Twitter data. Therefore, words corresponding to -5 or +5 could not exactly fit the writing-style of an official newspaper. In addition, headlines in particulare do not often contain strong wordings. It could be useful here to analyse words from the articles instead.

AFINN analysis for Boris Johnson headlines

A similar application of the AFINN dictionary was applied to headlines related to the ‘Boris Johnson’ bigram, which also resulted to be quite present in the outlet headlines (see WordCloud). The following plot shows the targeted sentiment analysis for such bigram.

The last interactive plot shows well-balanced situation. The Prime Minister seemed to be on the more negative side of the spectrum before, during and after COP26. Yet, Johnson seems to navigate well between positivity and negativity.

LSD2015 Visualization

To validate the consistency of our findings from the analysis made through the previous two dictionaries, a third dictionary - this time incorporated in the quanteda package - is used to perform the same kind of analysis. To avoid repetition of similar visualization of the result, with the LSD2015 dictionary we decided to show the distribution of sentiments across all the headlines, compared with the Boris Johnson targeted analysis. A distribution function of the sentiments would provide an indication above the homogeneity of the tone adopted by the newspaper.

Distribution of Sentiments

The results of the LSD2015 sentiment analysis are fed into a logarithmic function to evaluate a final score of positivity over negativity of the used words which will allow us to see the distribution of these words.

Concluding remarks

The research conducted above has shown that the general sentiments are consistent across all dictionaries used in the analysis – in all cases, the negatives outweigh the positives. Even before COP26 had begun, we can see that the Guardian reflected a rather pessimistic outlook regarding the climate conference.

Quantitative interpretation:

  • Whilst the sentiments are not distributed uniformly across time, it can be said that at every point in time, the guardian generally reflects a negative outlook on COP26.
  • Based on the AFINN dictionary analysis we observe a similar pattern: despite an initial surge in positive sentiments when COP26 began, these were still counteracted by an overwhelming number of headlines with negative sentiments.
  • The distribution of sentiments of all headlines vs. those including “Boris Johnson” do behave similarly, the headlines mentioning Boris are less polarizing, as indicated by lower peaks and higher troughs.
  • On top of that, the tails of the density plot show that amongst the negative words there were more extreme sentiments, compared to the most positive words.

Qualitative interpretation:

  • The three dictionaries presented a consistent result for the sentiment analysis of the scraped headlines.
  • Due to the project limitations, a further topic modeling could not be conducted to associate a prevalent sentiment to specific topics.
  • The speeches related to Boris Johnson (a politician) were demonstrated to be less sharply sentimental whereas a clear separation of the sentiments shows in the overall sentiment of the headlines.
  • The trend of negativity in a month time window around COP26 did not change considerably before and after, which could be justified by the already increasing negativity around the topic.

We suspect that this overall negative tone could be partially due to the sensationalist nature of newspaper headlines. Quite naturally, a daunting, apocalyptic headline will generate more clicks and engagement which could skew journalists into dramatized headlines. Ultimately, in our discussion of these results, we realised that the Guardian generally reflects our personal attitudes towards COP26 accurately. On one hand, it is encouraging to see countries joining forces and finding solutions as to how we can best tackle climate change, and on the other hand, we feel an increasing frustration that whilst we are taking steps into the right direction, we know that these steps are not enough.

Further research suggestions

As stated above, we feel like a potential drawback of our research is the fact that it has been limited to headlines exclusively. Not only does this constrain the context in which negative or positive words appear, it could also reflect an exaggerated sentiment. So, to have a more nuanced perspective on what the sentiments truly are, scraping could be expanded to include the actual articles themselves which would instantly increase our data set massively. This expansion would then help us to dig deeper into the topics from these articles and then apply a targeted sentiment analysis to these topics and associate a sentiment to each topic.

Finally, it would have been interesting to consider the sentiment surrounding COP26 from other newspaper sources, either across the United Kingdom, or even more broadly speaking, across the world over a larger time period.

Resources

Young, L. & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts]. doi: 10.1080/10584609.2012.671234. Political Communication, 29(2), 205–231

Text Mining with R: A Tidy Approach

Quanteda Tutorials